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ABSTRACT 

Recent advances in technology have led to a 
dramatic increase in the number of available tran- 
scription factor ChlP-seq and ChlP-chip data sets. 
Understanding the motif content of these data sets 
is an important step in understanding the underlying 
mechanisms of regulation. Here we provide a sys- 
tematic motif analysis for 427 human ChlP-seq data 
sets using motifs curated from the literature and 
also discovered de novo using five established 
motif discovery tools. We use a systematic 
pipeline for calculating motif enrichment in each 
data set, providing a principled way for choosing 
between motif variants found in the literature and 
for flagging potentially problematic data sets. Our 
analysis confirms the known specificity of 41 of 
the 56 analyzed factor groups and reveals motifs 
of potential cofactors. We also use cell type- 
specific binding to find factors active in specific 
conditions. The resource we provide is accessible 
both for browsing a small number of factors and 
for performing large-scale systematic analyses. We 
provide motif matrices, instances and enrichments 
in each of the ENCODE data sets. The motifs dis- 
covered here have been used in parallel studies to 
validate the specificity of antibodies, understand 
cooperativity between data sets and measure the 
variation of motif binding across individuals and 
species. 

INTRODUCTION 

Chromatin immunoprecipitation (ChIP) (1) followed by 
hybridization to an array (ChlP-chip) (2,3) or sequencing 
(ChlP-seq) (4) enables the genome-wide identification of 
the binding locations of transcription factors (TFs) 



present in a given condition and cell type or tissue. As 
these technologies have matured, their use has become 
increasingly widespread. The resolution of these experi- 
mental techniques can be as low as 300 bp for ChlP-chip 
(5) and 50 bp for ChlP-seq (6), depending on the experi- 
mental design (e.g. fragment size, paired-end sequencing) 
and algorithmic processing of the raw data. 

The use of these technologies on a variety of factors 
across many cell types has increasingly highlighted the 
complex nature of TF activity, often violating the simple 
model of a factor binding to its recognition pattern (motif) 
in isolation: binding has been shown to be dynamic across 
cell types, requiring the coordinated binding of cofactors 
or specific configurations of the underlying chromatin. 
Moreover, TF binding frequently occurs in the absence 
of any discernible motif instance (7,8) or to 'hot-spots' 
where several factors are simultaneously found (9). 
Understanding this complex binding necessitates identify- 
ing the underlying sequence features responsible. To 
address this need, we have performed a systematic, 
motif-centric analysis of hundreds of TF binding experi- 
ments made available as part of the human ENCODE 
project (8,10). As part of this, we provide a collection of 
motifs for each assayed factor, both taken from the litera- 
ture and through de novo discovery, and also an annota- 
tion of motif instances genome-wide, which may be 
used to pinpoint the specific regulatory bases in regions 
bound by TFs. 

We found that no single algorithm or database compre- 
hensively assays the motifs relevant to the binding diver- 
sity surveyed by ENCODE. Therefore, our approach was 
to collect motifs from several literature sources (11-16) 
and supplement them with motifs discovered de novo on 
the data sets themselves using five established tools 
(17-21). Although this general approach of using 
multiple motif discovery tools is popular [e.g. (22-24)], 
its application to this number of data sets is unprecedented 
and permits the identification of TFs that are likely to be 
interacting or participating in common pathways. 
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This work is accompanied by a web interface for 
browsing the discovered and literature motifs along with 
their enrichments (Figure 1; http://compbio.mit.edu/ 
encode-motifs). In addition to the browsing interface, we 
provide several data files including all motif matrices and 
their matches to the genome, as well as software to 
compute enrichments and perform unified motif discovery 
with the five tools we use. Together, these permit both 
analyses of individual factors (e.g. to identify cooperating 
TFs) in addition to systematic analysis (e.g. to examine 
differences between TFs). Moreover, the breadth of data 
sets available enables systematic comparisons and 
analyses that are not possible when only one or a few 
factors are studied in isolation. 

Later in the text, we describe the details of how the 
resource was generated and conduct an initial analysis to 
provide examples of its usage and to highlight potentially 
interesting results. 



MATERIALS AND METHODS 

Our goals were to produce a resource that (i) contains a 
comprehensive collection of relevant motifs for each 
factor; (ii) avoids repetitive, weakly enriched motifs that 
do not contribute to the in vivo specificity of the factor or 
its partners; and (iii) excludes variants of the same motif, 
particularly among the discovered motifs. With this in 
mind, we conducted motif discovery separately on each 
data set using five motif discovery tools and manually 
placed all its data sets into 'factor groups' on the basis 
of known motifs and homology (Figure 2). Known 
motifs from the literature and the top 10 most enriched 
discovered motifs (excluding duplicates) were collected for 
each factor group (see Supplementary Methods) and 
named as TF_known# for known motifs and TF_disc# 
for discovered motifs, where TF denotes the factor 
group (e.g. FOXA, CTCF, etc.). Known motifs were 
ordered arbitrarily, whereas the discovered motifs were 
ordered in descending order of the enrichment value that 
was used for their selection. 

The 427 ENCODE experiments analyzed correspond to 
123 TFs, which we place into 84 factor groups (Figure 3a). 
We failed to discover an enriched motif for only 12 of the 
84 factor groups, of which 9 lack DNA binding domains 
(BRF, CTBP2, HDAC8, KAT2A, NELFE, SUPT20H, 
SUZ12, WRNIP1 and XRCC4) as identified by UniProt 
(27), and 6 have all their data sets flagged as unreliable 
based on various quality metrics [BRF, KAT2A, NELFE, 
NR4A, SUPT20H and ZZZ3; see (A. Kundaje, L.Y. Jung, 
P.V. Kharchenko, B. Wold, A. Sidow, S. Batzoglou and 
P.J. Park, in preparation)]. Of these factor groups, only 
NR4A has a previously identified known motif. 

We exclude from the discussion below motifs that we 
consider unlikely to be relevant to our analysis, while 
maintaining them as part of the overall resource where 
they may be useful. These include 46 discovered motifs 
that are either low-complexity (e.g. dinucleotide repeats) 
or consistently have weak enrichment (<2) and do not 
match known motifs (Supplementary Table SI). These 
are likely a consequence of slight biases in the discovery 



pipeline, or are due to real, but relatively weak, specificity 
for the factor. We also exclude an additional 36 motifs 
that have a weak similarity to the known motif for the 
factor but for which a better matching and enriched motif 
is also found (Supplementary Table S2). These are most 
frequently seen for longer motifs that can be broken up 
into recognizable, but globally dissimilar, patterns that are 
not captured by our automatic exclusion criteria (see 
Supplementary Methods). Together, these represent 28% 
of the 293 discovered motifs. 



RESULTS 

Using motif similarity metrics, we are able to link the dis- 
covered motifs directly to the TFs that recognize them 
through their known motifs. Here we use these inferred 
relationships between TFs to make specific biological 
insights, illustrating the types of analyses that our 
resource enables. In the interest of clarity, most descrip- 
tions of TFs will be omitted, but may be found along with 
further references at RefSeq (28) and Entrez (29). 

Recovery of known specificity for TFs 

Most of the known literature motifs we collect are derived 
from biochemical in vitro assays. Thus, they provide a 
largely independent, although somewhat imperfect way 
to evaluate the performance of our discovered motifs. 
Recovery of known motifs varies significantly by 
method, but taking the most enriched motif (our 
pipeline) is competitive with the best single method 
(Figure 3b). Overall, our pipeline found a motif 
matching a previously characterized literature motif for 
41 of the 56 factor groups with a known motif. 

One of the most striking observations of this analysis is 
how frequently other distinct motifs were also found. For 
29 of these 41 factor groups other motifs are found, even 
after manually excluding redundant or repetitive motifs, 
and for 9 factor groups one or more of these discovered 
motifs is ranked higher than the motif matching a known 
motif (see Supplementary Table S3). In the next section, 
we will analyze the additional motifs we found for these 
factors, which in many cases identify factors known to 
interact, either cooperatively or competitively. 

For the remaining 15 of 56 factor groups with a known 
motif (e.g. HSF, NANOG, PBX3, SREBP and TALI) the 
known motif is not found at all, including NR4A where 
no enriched motif is discovered. Frequently this is because 
the known motif itself is not enriched and may not accur- 
ately capture the specificity of the factor in vivo. For 
example, the 'known' EP300 motif from Transfac was 
likely built on a specific bound region of EP300 and 
would not accurately capture its binding in all cell types 
where it interacts with a variety of factors and has no 
DNA binding domain of its own (we avoided removing 
such motifs to prevent bias in the database). Likewise, we 
do not discover a motif that matches the known ZBTB33 
specificity, and moreover the known motif itself is not 
enriched at all in the bound regions. 

Although some known motifs were of apparently low 
quality, we largely found our database of known motifs to 
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For each experiment of 
a given factor group 



AlignACE 



Take top 3 motifs 
for each tool 



Randomly split all peaks 
into two partitions 



Top 250 of partition #1 




MEME 




Weeder 



Compendium of 
discovered motifs 



Discovered motifs 
ranked by enrichment 



All peaks of partition #2 



Take top 10 
non-redundant motifs 



Figure 2. Outline of motif discovery pipeline. Input regions for each data set are randomly partitioned into two groups. The top 250 regions of one 
of the partitions are scanned for motifs using five de novo motif discovery tools. These motifs are evaluated using the peaks from the other 
partitioned and pooled across data sets for a factor group to produce the final list of discovered motifs for each factor group. 



be relatively comprehensive and had difficulty finding 
matches to novel motifs outside it. An exception is 
ZNF263_discl, which does not match a motif in our 
database, but does roughly match the specificity for 
ZNF263 indicated in (30) despite only having weak en- 
richment (1.8-fold). 

Although the motifs that match each other (either 
known or discovered) generally have similar enrichments, 
in some cases we find substantially higher enrichment for 
some motif variants over others (Figure 4 and 
Supplementary Table S3). For example, NFE2_discl 
matches the known NFE2 motif, but has a 76-fold 
maximal enrichment across NFE2 data sets, compared 
with 56-fold enrichment for the most enriched known 
NFE2 motif. Different known motifs for the same factor 
often show a broad range in enrichment: MEF2 has six 
motifs described in Transfac, with an enrichment differen- 
tial of as much as 4-fold consistently across data sets. This 
enrichment analysis provides a systematic way to choose 
among variants of a motif. 

We also saw varying enrichment of the known motif, 
depending on the specific data set for a factor group. For 
example, CTCF_known2 is enriched in CTCF data sets in 
a range from 30- to 78-fold on identically processed data. 
This may be a result of varying quality of the samples 
across data sets or may be a consequence of true biological 
differences. 

Identifying the sequence specificity for factors that were 
previously uncharacterized is of particular interest. In all, 
17 factor groups had no known motif but now have dis- 
covered enriched motifs (BCL, BDP1, CCNT2, CHD2, 
CTCFL, HDAC2, HMGN3, RAD21, SETDB1, SIRT6, 
SMARC, SMC3, SP2, SIN3A, THAP1, TRIM28 and 
ZNF263). These discovered motifs may represent the 
direct or indirect (e.g. through cofactors) DNA binding 
specificity. 



Shared motifs suggest interacting relationships 

We find that most factors have motifs for other factors 
enriched in their binding sites (summarized in 
Supplementary Table S4). This may occur due to (i) co- 
operative binding of the two factors to the same locations; 
(ii) interfering binding between factors where one binds 
near the other to prevent binding; (iii) some similarity in 
motif specificity; (iv) the two factors functioning on a 
similar set of genes (e.g. ones specific to one tissue), 
without directly interacting; or (v) the factors binding to 
similar genomic regions (e.g. near genes). Our analysis 
does not directly rule out any of these possibilities; 
however, (iii) is generally verifiable using our motif simi- 
larity metrics and (v) can be examined by inspecting only 
the TSS-proximal enrichment. 

The motif most enriched in multiple data sets was the 
TP A DNA response element (TRE; TGA[C/G]TCA), 
which is recognized by the API TF when it is formed by 
FOS/JUN dimers (31) and other factors including MAF 
and NFE2. The enrichment of the TRE in a data set is 
often stronger than that of even the known in vitro 
sequence specificity and may arise from a number of phe- 
nomena, including (i) a cooperatively interaction with 
API, (ii) competition with API for the same binding 
sites, leading to a potentially repressive role for the TF 
or (iii) reuse of binding sites due to, for example, accessi- 
bility of chromatin. We find a motif matching the TRE 
motif for 20 factor groups (APl_disc3, AP2_discl, 
BATF discl, BCL_disc2, CTCF_disc8-9, EP300_discl, 
GATA_disc2, HMGN3_discl, IRF_disc2, MAF discl, 
MEF2_disc3, MYC_disc3, NFE2_discl, NR3Cl_disc2, 
PRDMl_disc2, RXRA_disc3, SMARC_discl, STAT_ 
disc2, TCF7L2_discl and TRIM28_discl). 

We found that the enrichment of the TRE to be par- 
ticularly notable for a few factors. GAT A and API have 
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Figure 3. (a) Summary of input data used. The outside ring indicates 
the experimental data sets (one tick for each of 427), which are 
separated into 123 transcription factors (second ring). The TFs are 
further grouped into 84 factor groups (third ring). We are able to 
find a matching discovered motif for 41 of the 56 factor groups with 
a known motif; 29 of these 41 factor groups have additional discovered 
motifs that may be associated with cofactors. For all but 1 of the 15 
factor groups where the known motif is not recovered we still find 
enriched discovered motifs. We also discovered enriched motifs for 17 
of the 28 factor groups without a known motif, (b) Recovery of known 
motifs by each of the discovery tools. Performance of discovery in 
terms of number of factor groups for which the known motif was re- 
covered. A motif is considered a match if it matches any of the known 
motifs for a factor group (see Supplementary Methods for details on 
how matches are computed). The number of additional factors that 
have a match is shown with each additional motif (only three motifs 
are taken from each individual method, whereas we have up to 10 for 
the pipeline). The number of factor groups with no motif match is 
shown in parenthesis. When multiple data sets exist for a factor 
group, the fraction that matches is used in computing its contribution 
for computing the performance of the individual tools. 

known cooperative binding (32). TFs in the SMARC 
factor group are members of the SWI/SNF chromatin 
remodeling complex (33), which is necessary for proper 
regulation by FOS/JUN dimers (34); and TCF7L2_ 
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Figure 4. Comparison of known versus discovered motifs (selected 
where discovered better enriched than known; all factor groups with 
a discovered motif matching a known motif in Supplementary Table 
S3). Displayed is the known and discovered motif with the maximum 
enrichment across all data sets for a factor group. Only the discovered 
motifs that match a known motif for a factor group are considered. 
The maximum enrichment is indicated for each factor and, in paren- 
thesis, the 'raw' enrichment for the same data set without the use of the 
shuffle motifs for correction. 



disci, which matches the TRE, is more enriched than 
the known TCF7L2 motif (TCF7L2_disc2) in only the 
TCF7L2 colorectal cancer cell line HCF-1 16 data set, con- 
sistent with the known interaction of JUN and TCF7L2 
during intestinal cancer development (35). 

API also binds to the cAMP response element (CRE; T 
GACGTCA) when the dimer is formed by ATF3/JUN 
(31) and this is the motif we find as APl_discl. 
However, APl_disc3 (which matches the TRE) is the 
most enriched motif in FOS data sets. Interestingly, 
ATF3_discl is not the CRE, but rather the E-box (see 
later in text). We do, however, find a variant of the 
CRE (with additional specificity) as ATF3_disc2. The 
most enriched discovered motif for E2F, E2F_discl also 
matches the CRE and is highly enriched in all data sets. 

MYC is a critical regulator, which recognizes the E-box 
sequence. To aid in comparisons, we include MAX, which 
forms complexes with MYC, and USF1/2, which also rec- 
ognizes the E-box sequence, in the MYC factor group. 
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We find multiple motifs enriched in MYC binding sites, 
highlighting the multifunctional role MYC and the other 
E-box recognizing proteins play. We found a version of 
the E-box with additional specificity (MYC_discl) that 
was highly enriched in USF1/2 bound regions (max 98- 
fold for USF2 versus <9-fold enrichment for MYC/ 
MAX). This motif was more enriched than the known 
E-box motifs, including known USF motifs, in many 
USF data sets. We find a second, less specific E-box 
motif (MYC_disc2), which shows more even enrichment 
across factors. We also find discovered motifs of other 
factors matching the E-box, including SIN3A_disc2 (dis- 
cussed later in text), NFE2_disc2-3 and SIRT6_discl. It is 
notable that although SIRT6 is a chromatin-associated 
protein without a known DNA binding domain (36), the 
only discovered motif matches the E-box (with 16-fold 
enrichment in SIRT6 bound regions), suggesting that 
MYC or another E-box recognizing factor may play an 
important, but indirect, chromatin-related role. 

Motif enrichment is able to identify both positive and 
negative interactions for the same factor. For example, 
SIN3A, a corepressor known to interact with a number 
of proteins, has discovered motifs matching REST 
(SIN3A_discl and more weakly disc3-4) and MYC 
(SIN3A_disc2). These are consistent with SIN3A's 
known involvement in repression by REST (37) and 
SIN3A being a known antagonist for MYC (38). 

Morever, MYC_disc4 matches RFX5 and is enriched 
particularly for MAX-bound regions in Hl-hESC and 
GM 12878, and MYC_disc5 matches the CEBPB known 
motif and is enriched in MYC regions bound in unstimu- 
lated K562 cells. MXI1, which was not included in the 
MYC factor group although it does interact with MAX 
to bind to MYC-MAX sites (39), has MXIl_discl that 
matches RFX5 in both the K562 and HeLa-S3 cell lines. 

We analyzed six IRF family data sets: IRF1 binding in 
K562 cells stimulated by IFNa (viral innate response) or 
IFNg (viral, bacterial and tumor control); IRF3 in 
HepG2, GM 12878 and HeLa-S3; and IRF4 in 
GM 12878. The most strongly enriched motif 
(IRF_discl, matching NFY) is highly enriched (>20- 
fold) for all three IRF3 data sets and IRF1 in K562 
under IFNg stimulation. This suggests that binding of 
IRF to NFY sites occurs only under specific conditions 
and by only some IRF members and potentially expands 
on the previously documented interaction of NFY and 
IRF2 at a single promoter (40). IRF_disc4, which 
matches SP1, is enriched in the same cell types, albeit at 
much lower levels. IRF_disc3, which matches the known 
IRF consensus, shows weak-to-no enrichment in these 
data sets, but shows an enrichment of 8.8-fold for IRF1 
bound regions in K562 cells under IFNa stimulation and 
3.1 -fold enrichment for IRF4 bound regions in GM 12878. 
IRF_disc2, which matches the TRE, is enriched primarily 
in GM 12878 regions bound by IRF4. The known SPI1 
motif matches IRF_disc5, and reciprocally SPIl_disc2 
matches the IRF motif, consistent with the importance 
of SPI1 in hematopoietic development (41). 

Beyond the discovered motif for IRF, several other dis- 
covered motifs (APl_disc2, CEBP_disc2, E2F_disc4, 
PBX3_discl, RFX5_disc2 and SPl_discl-2) match the 



known NFY specificity (CCAAT). These discovered 
motifs are consistent with several known interactions of 
NFY. RFX5 promotes the cooperative binding between 
RFX and NFY (42), CEBPB and NFY interact in at least 
one promoter (43) and SP1 and NFY are known to 
interact (44). E2F_disc4 has particularly high enrichment 
in E2F4 data sets, consistent with the cooperative role 
E2F4 and NFY play in cell cycle regulation (45). 

STAT factors are involved in regulating number of 
growth-related functions. We analyze STAT1, STAT2 
and STAT3 here in the context of GM12878, HeLa-S3, 
MCFIOA-Er-Src and K562 cells. We find relatively con- 
sistent enrichment of the STAT full site (TTCCNGGAA), 
which STAT_discl matches, while finding weak enrich- 
ment for just the half-site (TTCC). We also find motifs 
involved in other proliferative functions including 
STAT_disc2, which is particularly enriched in STAT3 
data sets and matches the TRE, consistent with STAT3 
being one of the many interaction partners for API (46). 
STAT_disc3 matches the IRF consensus and has enrich- 
ment that is particularly high in STAT1 and STAT2 data 
sets stimulated by IFNa, highlighting the cooperativity of 
STAT factors and IRF in immune functions. STAT_disc4 
is a match to the CEBPB motif and is found enriched in 
STAT3 data sets, consistent with the known cooperative 
role for these two factors (47). 

TFs with ETS domains are highly conserved and 
involved in several cellular processes [reviewed in (48)]. 
A number of TFs have discovered motifs that match the 
ETS consensus, including EGRl_disc2, GATA_disc3, 
MEF2_disc2, NRFl_disc2, NR2C2_discl and 
PAX5_disc4. These discovered motifs are supported by 
known interactions between GATA and ETS in sea 
squirts (49), MEF2 and the ETS factor PEA3 (50) and 
NR2C2 with the ETS factor ELK4 (51). Moreover, 
PAX5 and ETS factors have shared roles in the develop- 
ment of B-cells (52,53). Looking at the discovered ETS 
motifs, we find that ETS_disc8 matches the known motif 
for MYB and the two have been known to cooperate, a 
relationship that is important in the context of certain 
cancers (54). 

THAP1 has two discovered motifs, both of which 
match the known YY1 motif (the first with additional 
specificity added by an apparent HNF4 motif). To our 
knowledge, the relationship between THAP1 and YY1 
has not been directly observed; however, THAP1 has 
been known to associate with the coactivator HCF-1 
(55), and YY1 and HCF-1 are known to interact (56). 
Our result suggests that THAP1 and YY1, possibly with 
the addition of HNF4, may interact at least in the K562 
cell line for which we have THAP1 binding data. 
RAD21_disc3 also matches YY1, suggesting an additional 
interaction. 

NANOG, an important pluripotency TF, has a known 
motif that is only weakly enriched (1.3-fold) in the bound 
regions and not discovered by our pipeline. We see much 
stronger enrichment for the known POU5F1 and 
POU2F2 motifs, for which we also find similar motifs 
(NANOG_disc2 and NANOG_disc4, respectively), con- 
sistent with their shared roles in pluripotency (57,58). 
The interaction of these factors is further supported by 
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POU5Fl_disc2 matching the known POU2F2 motif. 
Additionally, NANOG_disc2 and disc3 match the 
known motifs for TCF7L2 and TCF12, respectively, 
again consistent with the important role TCF proteins 
play in stem cells (59). 

CTCF plays a variety of vital roles in the organization 
of chromatin architecture (60) and the motifs we discover 
matching the known CTCF specificity (RAD21_discl, 
SMC3_disc 1,2-4, CTCFL_disc 1,10, ZBTB7 A_disc 1 ,2, 
SP2_disc3 and RXRA_disc2,5; some weakly) are largely 
compatible with this role. RAD21 is a highly conserved 
protein involved in DNA double-strand repair (61) known 
to co-localize with CTCF (62). Cohesin, of which SMC3 is 
a subunit, is brought to the chromatin by CTCF (63). 
Further, although the function of the CTCF paralog 
CTCFL is not completely known, it does appear to be 
involved in imprinting through interaction with a 
histone methyl transferase (64). 

Combinations of motifs 

A few of the discovered motifs contain additional specifi- 
city or have distinct segments matching multiple motifs. 
For example, EGRl_disc4 appears to be a combination of 
multiple motifs (EGR1, IKZF1 and a homeobox motif), 
and SETDBl_discl contains the ZNF143 core sequence 
with significant additional specificity. The appearance of 
these motifs suggests highly specific 'grammars' for these 
motifs that may require specific spacing and orientation of 
binding sites for functionality. 

We find several additional enrichments of potential 
interest. PBX3_disc2 matches the known MEIS1 motif, 
consistent with the known cooperative binding of 
MEIS1 and PBX (65). TALl_discl matches GATA, 
with the potential connection that GATA and TALI are 
known to be important in hematopoesis and vascular de- 
velopment (66,67). HSF_discl matches the known CEBP 
motif and has much higher enrichment in HSF data sets 
(31 -fold) compared with the known motifs for HSF (<9- 
fold). Additionally, EGRl_disc5, HNF4_disc5, 
NRFl_disc3, PAX5_disc2, RXRA_disc4/PAX5_disc3 
and SREBP_discl match the known motifs for ZIC, 
SOX, SP1, PAX2/PAX3, IRF and RFX5, respectively, 
suggesting additional previously uncharacterized inter- 
actions. Lastly, we find some motifs that show more am- 
biguous matches: SMARC_disc2 shows weak similarity to 
homeobox TGTAGT motif, NR2C2_disc2-3 weakly 
matches the known HNF4 motif and EGRl_disc3/ 
SETDBl_disc2 matches the repetitive NRF1 motif. 

General factors enriched in cell line-specific key regulators 

Factors directly responsible for the establishment of en- 
hancers, chromatin restructuring or polymerase recruit- 
ment frequently exhibit binding that is highly cell type 
specific. Because most of these factors do not have their 
own sequence specificity, their binding is often correlated 
with that of regulators important for the specific cell line. 
We analyze several such factors (BCL, BDP1, CCNT2, 
EP300, FOXA, HDAC2, HMGN3, TATA, TCF 12 and 
TRIM28) and find that key cell line regulators can be 



identified by examining enrichments in cell lines-specific 
data sets. 

As a transcriptional coactivator, EP300 interacts with 
numerous TFs [reviewed in (68)] and has been shown to 
have binding that can identify tissue-specific enhancers 
(69). Conversely, FOXA has a DNA binding domain 
and plays an important role in liver development and 
function (70) and is a pioneer factor responsible for 
priming chromatin for the binding of other factors 
(reviewed in (71)]. Other proteins involved in chromatin 
restructuring include HDAC2, which transcriptionally 
represses through histone deacetylation (72) and 
HMGN3 (73). Further, two factor groups are directly 
involved in transcription including three RNA Pol3 
subunits (BDP1, RPC155 and TFIIIC-110) and CCNT2, 
which is involved in the elongation of Pol2 (74). 

Eight of these ten factor groups have at least one data 
set in K562 (erythroleukemia cells), and for four of these 
we discover motifs that match the GATA consensus, 
which is then enriched specifically in the K562 data sets 
(BCL_disc5, CCNT2_discl , HDAC2_discl and 
HMGN3_disc2). GATA has a known important role in 
K562 (75), and we also have previously found an associ- 
ation with GATA motifs and chromatin state-derived en- 
hancers for K562 cells (76). We also find three additional 
motifs that have enrichment specific to the factor group's 
K562 data set: BDPl_discl, a 23-nt motif that contains 
the STAT consensus; HMGN3_discl, which matches the 
TRE; and TRIM28_disc2, which matches no known motif 
and may be associated with an uncharacterized regulator 
active in this cell line. 

Likewise, for GM 12878, an EBV-mediated 
lymphoblastoid cell line, we find three discovered motifs 
(BCL_disc4, EP300_disc5 and TCF12_disc4) that match 
the known IRF consensus. IRF4 has been shown to be 
important in the establishment of these cell lines (77), and 
the family is an important player in immune cells (78). 
This enrichment is also consistent with our previous 
study using epigenetic marks (76), where we found IRF 
to be the strongest enriched motif in GM12878-specific 
enhancers. We also find GM12878-specific enrichment 
for motifs matching NFKB (BCL_disc6) and POU2F2 
(TATA_disc9), consistent with the known biology of 
these factors (79,80). 

The motifs we find specifically enriched in HepG2 (liver 
carcinoma) data sets match the known motifs for FOXA 
(EP300_disc3, HDAC2_disc2, and TCF12_disc2), HNF4 
(FOXA_disc5 and HDAC2_disc5) and CEBP 
(EP300_disc2,6), three key liver regulators (70,81). We 
find motifs with enrichments specific to Hl-hESC, which 
include matches to the pluripotency factor POU2F2 
(TATA_disc9), the near universally expressed repressor 
REST (BCL_disc3 and HDAC2_disc4) and key metabolic 
regulator NRF1 (HDAC2_disc4). We find additional cell 
line-specific enrichments for FOXA_disc3 (TCF 12) in 
ECC-1, FOXA_disc4 (STAT) in both T-47D and ECC-1 
and EP300_disc2,6 (CEBP) and EP300_disc4 (ETS) with 
enrichment in the HeLa-S3 data set. 

Even for these factors, we find motifs that are consist- 
ently enriched across assayed cell lines for a given factor. 
FOXA_discl, for example, matches the known FOXA 
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motif, indicating that FOXA's own motif also plays an 
important role in its specificity. Most of the motifs we 
identify for RNA Pol2 machinery (TAF1, GTF2B, 
GTF2F1 and TBP) are enriched in all cell lines, including 
the known TATAAA motif (T AT A_known2) . Also, 
TATA_discl, disc6 and disc8 have consistent enrichment 
and match the known motifs for YY1 (which is known to 
be important in establishing transcription) (82), NFY and 
ETS. The top discovered motif BCL_discl matches the 
known ETS motif and is also enriched across data sets. 

Interestingly, we find that the TRE motif is found and 
enriched in a cell line-specific manner for several factors, 
but for different cell lines. For example, HMGN3_discl is 
enriched in K562, BCL_disc2 has the highest enrichment 
in GM12878, TRIM28_discl is only enriched in the 
HEK2932 and U20S cell lines and EP300_disc7 has en- 
richment in the neuroblastoma cell line SK-N-SH-RA and 
HeLa-S3. This suggests that perhaps API or other factors 
recognizing TRE are selectively interacting with these 
proteins depending on the cell line. 

Novel motifs raise possibility of unknown regulators 

Although we are able to putatively explain the majority 
of the motifs we discover as either matches to previously 
known motifs or low complexity sequences, we do identify 
30 putative novel motifs (Figure 5). We placed 
these into eight groups on the basis of their similarity: 
Novell (BRCAl_discl, CHD2_discl, ETS_disc3,6, 
NR3Cl_disc3 and ZBTB33_discl-4), Novel2 (EGR1_ 
disc4, ETS_discl,5,7, SETDBl_discl, SIX5_discl-3, 
SMARC_disc2 and ZNF143_discl-3), Novel3 
(SP2_disc3, TCF12_disc3 and ZBTB7A_disc2), Novel4 
(RFX5_disc3), Novel5 (BDPl_disc2), Novel6 



(TATA_disc5,7), Novel7 (TRIM28_disc2) and Novel8 
(E2F_disc6). 

Novell (using ZBTB33_discl) is highly enriched in at 
least one data set for each of the factor groups for which 
it is found (BRCA1, CHD2, ETS, NR3C1 and ZBTB33). 
All five factor groups except CHD2 have at least one 
known motif, and for each of these data sets Novell is 
more enriched in at least one data set than any known 
motif [the result for NR3C1 is questionable because only 
one data set has enrichment and that data set has been 
independently flagged as problematic; see http://www. 
encodeproject.org/encode/qualityMetrics.html]. The 
shared role of BRCA1 and CHD2 in DNA damage 
repair (83,84) suggests that Novell may be involved in 
this or other shared roles for these factors and highlights 
the utility in shared motif enrichment even outside of motifs 
directly tied to a factor. 

Similarly, for SIX5, we see only weak enrichment of the 
known SIX5 motif and fail to discover a motif similar to 
it. However, Novel2 (using SIX5_discl) shows over 100- 
fold enrichment for all three data sets (K562, GM 12878 
and Hl-hESC). Novel2 also shows high enrichment in 
data sets for which it was not rediscovered, including 
ATF3 (all data sets have >20-fold enrichment with 
GM 12878 having 106-fold) and NRF1 (all data sets 
have > 30-fold enrichment). Moreover, the known 
ZNF143 motif, which is 4-fold enriched in the one 
ZNF143 data set, is also not recovered, but Novel2 is 
24-fold enriched. The breath of data sets sharing this 
motif suggests it may be recognized by an important yet 
unknown or under-characterized regulator. 

Like the known ZBTB7A motif, Novel3 (using 
SP2_disc3) is largely poly-G, which causes us to underesti- 
mate its enrichment due to our shuffling process. Despite 
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this, however, it does show enrichment in several data sets, 
including for the factor groups for which it was identified. 
This motif shows similarity to other poly-G motifs, such 
as known SP1 motifs, but appears to be distinct due to its 
other bases. 

Novel4 (RFX5_disc3) shows moderate, but consistent 
(2- to 6-fold) enrichment across the RFX5 data sets. The 
consensus is composed of two of the same components as 
the known motifs (AAC and TGA), but ordered differ- 
ently. Consequently, it may represent the binding specifi- 
city of, for example, an alternative isoform of RFX5. The 
remaining motifs (Novel5-8), were found for factors that 
show cell line-specific enrichments. Consequently, these 
may represent specificities for regulators that are previ- 
ously unidentified. 

Experimental and evolutionary validation of novel motifs 

Following the motif discovery and selection of these 
putative novel motifs, a study released hundreds of new 
motifs generated using high-throughput SELEX (16). Two 
of the putative novel motifs described in this section match 
motifs generated by (16): Novell matches the motif 
for ETV6 and Novel6 matches ZBED1. Although we 
have incorporated these SELEX motifs into our 
resource, we continue to include Novell and Novel6 as 
putative novel motifs because they were identified 
without knowledge of these new specificities and thereby 
strengthen the evidence for the remaining novel motifs. 

Four of these putatively novel motif groups (Novel 1-3, 
6) match motifs that were previously identified using 
conservation signals across four mammals (85) 
(Supplementary Table S5). Therefore, this study 
provides additional support for these conservation-based 
motifs and, conversely, the motifs identified here gain 
comparative evidence. The relatively few distinct novel 
patterns that are found in this study and the comparative 
support for many of the few that are found suggests that 
there may be a limited number of human TF motifs with 
many instances and which interact with one of the assayed 
factors that remain unknown. 



DISCUSSION 

In this article, we provide a systematic and comprehensive 
collection of motifs for hundreds of human TF binding 
data sets. TF binding can be complex, with a factor 
recognizing several or motifs or binding in the apparent 
absence of any motif [reviewed in (86)]. We also show that 
it is possible to identify cofactors that may be partially 
responsible for binding or function. 

This motif resource has already been used in several 
articles while this article was in preparation, demon- 
strating its value for high-throughput analyses. Our 
motifs are being matched at low stringency to identify 
peaks that are void of any motif to understand the mech- 
anism through which motif-less peaks are generated (8). 
The collection of known motifs and enrichment tech- 
niques we present here was also used as a secondary 
validation of peaks (87). Because having the motifs 
allows for more precisely determining the bases 



responsible for binding, these motifs enable analyzes 
involving population data (88) and for interpreting 
GWAS data (89). Two other ENCODE articles also 
perform motif discovery: (90) produce a non-redundant 
list of discovered motifs but do not perform an extensive 
analysis of the relationships between factors and (91) use 
DNasel footprinting data to identify relevant motifs. 

Having a motif catalog is also the first step in identify- 
ing high-quality computational targets of factors, which 
may allow the identification of binding sites that were, for 
example, not found in the conditions assayed. Two 
popular strategies are used for this purpose. One is using 
clustering of motif instances for factors known to cooper- 
ate to form c/s-regulatory modules (92,93). This resource 
is well suited for this purpose because it naturally provides 
sets of motifs that are likely to cooperate. 

A second approach is the use of conservation on many 
closely related species (85,94-97). This can be performed 
readily on these motif instances because a dense tree of 
mammalian species has been sequenced readily permitting 
their alignment and measuring selection of a near-nucleo- 
tide level. Because changes in the underlying motif 
matches are largely responsible for changes in binding 
across species (98), evolutionary-based approaches on 
the motif instances may be a means to deal with the 
high rate of non-functional binding (99-101). 
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